In typical reinforcement learning (RL), the environment is assumed given andthe goal of the learning is to identify an optimal policy for the agent takingactions through its interactions with the environment. In this paper, we extendthis setting by considering the environment is not given, but controllable andlearnable through its interaction with the agent at the same time.Theoretically, we find a dual Markov decision process (MDP) w.r.t. theenvironment to that w.r.t. the agent, and solving the dual MDP-policy pairyields a policy gradient solution to optimizing the parametrized environment.Furthermore, environments with discontinuous parameters are addressed by aproposed general generative framework. While the idea is illustrated by anextended two-agent rock-paper-scissors game, our experiments on a Maze gamedesign task show the effectiveness of the proposed algorithm in generatingdiverse and challenging Mazes against different agents with various settings.
展开▼